Multi-stage large send offload

ABSTRACT

A network stack sends very large packets with large segment offload (LSO) by performing multi-pass LSO. A first-stage LSO filter is inserted between the network stack and the physical NIC. The first-stage filter splits very large LSO packets into LSO packets that are small enough for the NIC. The NIC then performs a second pass of LSO by splitting these sub-packets into standard MTU-sized networking packets for transmission on the network.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.12/722,434, Attorney Docket No. 328846.01/104709.000629, entitled“MULTI-STAGE LARGE SEND OFFLOAD”, filed on Mar. 11, 2010, the entiretyof which is incorporated herein by reference.

BACKGROUND

When a guest computer system is emulated on a host computer system, theguest computer system is called a “virtual machine” as the guestcomputer system only exists in the host computer system as a softwarerepresentation of the operation of one specific hardware configurationthat may diverge from the native machine. The virtual machine presentsto the software operating on the virtual machine an emulated hardwareconfiguration.

A virtual machine management system (sometimes referred to as a virtualmachine monitor or a hypervisor) is also often employed to manage one ormore virtual machines so that multiple virtual machines can run on asingle computing device concurrently. The virtual machine managementsystem runs directly on the native hardware and virtualizes theresources of the machine by exposing interfaces to virtual machines foraccess to the underlying hardware. A host operating system and a virtualmachine management system may run side-by-side on the same physicalhardware. For purposes of clarity will we use the abbreviation VMM torefer to all incarnations of a virtual machine management system.

One problem that occurs in the operating system virtualization contextrelates to computing resources such as data storage devices, data inputand output devices, networking devices etc. Because each of hostcomputing device's multiple operating systems may have differentfunctionality, there is a question as to which computing resourcesshould be apportioned to which of the multiple operating systems. Forexample, a virtualized host computing device may include only a singlenetwork interface card (NIC) that enables the host computing device tocommunicate with other networked computers. This scenario raises thequestion of which of the multiple operating systems on the virtualizedhost should be permitted to interact with and control the NIC.

When one of the operating systems controls the NIC, the other operatingsystems sends it packets to the network through the operating systemthat controls the NIC. In such a case, the packet size accepted by theNIC may not be known. However, sending network TCP packets through anetwork stack is computationally expensive. Resources must be allocatedfor each packet, and each component in the networking stack typicallyexamines each packet. This problem is compounded in a virtualizationenvironment, because each packet is also transferred between the guestVM to the root operating system. This entails a fixed overhead perpacket that can be quite large. On the other hand, the networking stackpacket size is normally limited by the maximum transmission unit (MTU)size of the connection, e.g, 1500 bytes. It is not typically feasible toincrease the MTU size since it is limited by network infrastructure.

Hardware NICs provide a feature called “Large Send Offload” (LSO) thatallows larger TCP packets to travel through the stack all the way to theNIC. Since most of the cost per packet is fixed, this does a fairly goodjob, but NICs typically support packets that are fairly small, around 62KB. There is a need for the transmission between operating systems oflarger packets to reduce overhead.

SUMMARY

The embodiments described allow a network stack to send very largepackets, larger than a physical NIC typically supports with largesegment offload (LSO). In general, this is accomplished by performingmulti-pass LSO. A first-stage LSO filter is inserted somewhere betweenthe network stack and the physical NIC. The first-stage filter splitsvery large LSO packets into LSO packets that are small enough for theNIC. The NIC then performs a second pass of LSO by splitting thesesub-packets into standard MTU-sized networking packets for transmissionon the network.

To that end, a first operating system operating on a computing devicereceives an indicator of a first LSO packet size. The first LSO packetsize is a multiple of a second LSO packet size that is supported by anetwork interface card connected to the computing device. The firstoperating system formats data (e.g., from an application) into a firstpacket of a first LSO packet size. The first packet is then transferredto a second operating system on the same computing device that hasaccess to a network interface card. The first packet is then split onthe second operating system into multiple LSO packets of a second LSOpacket size that can be consumed by the network interface card. Themultiple LSO packets are sent to the network interface card fortransmission on the network in packets of a size supported by thenetwork.

In general, the first operating system is executing on a virtual machineand the indicator of a first LSO packet size is received from ahypervisor operating on the same computing device. The virtual machinecan be migrated to a second computing device and another indicator of afirst LSO packet size is received from a hypervisor operating on thesecond computing device. The indicator of the first LSO packet sizereceived from the hypervisor operating on the second computing devicecan different from the indicator of the first LSO packet size receivedfrom the hypervisor on the computing device. Consequently, the indicatorof the first LSO size received from each of the hypervisor operating onthe computing device and the hypervisor operating on the secondcomputing device can be tuned for the specific computing device's CPUusage, throughput, latency or any combination thereof

In general, the first packet has a TCP header. The packet header fromthe first packet is copied to the packets of second LSO-sized packetswhen they are split out.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing summary, as well as the following detailed description ofpreferred embodiments, is better understood when read in conjunctionwith the appended drawings. For the purpose of illustrating theinvention, there is shown in the drawings exemplary constructions of theinvention; however, the invention is not limited to the specific methodsand instrumentalities disclosed. In the drawings:

FIG. 1 is a block diagram representing a computer system in whichaspects of the present invention may be incorporated;

FIG. 2 illustrates a virtualized computing system environment;

FIG. 3 illustrates the communication of networking across avirtualization boundary;

FIG. 4 is a flow diagram of the packet processing in accordance with anaspect of the invention; and

FIG. 5 is a flow diagram of the processing performed by the virtualswitch according to an aspect of the invention.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS

The inventive subject matter is described with specificity to meetstatutory requirements. However, the description itself is not intendedto limit the scope of this patent. Rather, the inventor has contemplatedthat the claimed subject matter might also be embodied in other ways, toinclude different combinations similar to the ones described in thisdocument, in conjunction with other present or future technologies.

Numerous embodiments of the present invention may execute on a computer.FIG. 1 and the following discussion is intended to provide a briefgeneral description of a suitable computing environment in which theinvention may be implemented. Although not required, the invention willbe described in the general context of computer executable instructions,such as program modules, being executed by a computing device, such as aclient workstation or a server. Generally, program modules includeroutines, programs, objects, components, data structures and the likethat perform particular tasks. Those skilled in the art will appreciatethat the invention may be practiced with other computer systemconfigurations, including hand held devices, multi processor systems,microprocessor based or programmable consumer electronics, network PCs,minicomputers, mainframe computers and the like. The invention may alsobe practiced in distributed computing environments where tasks areperformed by remote processing devices that are linked through acommunications network. In a distributed computing environment, programmodules may be located in both local and remote memory storage devices.

Referring now to FIG. 1, an exemplary general purpose computing systemis depicted. The general purpose computing system can include aconventional computer 20 or the like, including at least one processoror processing unit 21, a system memory 22, and a system bus 23 thatcommunicative couples various system components including the systemmemory to the processing unit 21 when the system is in an operationalstate. The system bus 23 may be any of several types of bus structuresincluding a memory bus or memory controller, a peripheral bus, and alocal bus using any of a variety of bus architectures. The system memorycan include read only memory (ROM) 24 and random access memory (RAM) 25.A basic input/output system 26 (BIOS), containing the basic routinesthat help to transfer information between elements within the computer20, such as during start up, is stored in ROM 24. The computer 20 mayfurther include a hard disk drive 27 for reading from and writing to ahard disk (not shown), a magnetic disk drive 28 for reading from orwriting to a removable magnetic disk 29, and an optical disk drive 30for reading from or writing to a removable optical disk 31 such as a CDROM or other optical media. The hard disk drive 27, magnetic disk drive28, and optical disk drive 30 are shown as connected to the system bus23 by a hard disk drive interface 32, a magnetic disk drive interface33, and an optical drive interface 34, respectively. The drives andtheir associated computer readable media provide non volatile storage ofcomputer readable instructions, data structures, program modules andother data for the computer 20. Although the exemplary environmentdescribed herein employs a hard disk, a removable magnetic disk 29 and aremovable optical disk 31, it should be appreciated by those skilled inthe art that other types of computer readable media which can store datathat is accessible by a computer, such as magnetic cassettes, flashmemory cards, digital video disks, Bernoulli cartridges, random accessmemories (RAMs), read only memories (ROMs) and the like may also be usedin the exemplary operating environment. Generally, such computerreadable storage media can be used in some embodiments to storeprocessor executable instructions embodying aspects of the presentdisclosure.

A number of program modules comprising computer-readable instructionsmay be stored on computer-readable media such as the hard disk, magneticdisk 29, optical disk 31, ROM 24 or RAM 25, including an operatingsystem 35, one or more application programs 36, other program modules 37and program data 38. Upon execution by the processing unit, thecomputer-readable instructions cause the actions described in moredetail below to be carried out or cause the various program modules tobe instantiated. A user may enter commands and information into thecomputer 20 through input devices such as a keyboard 40 and pointingdevice 42. Other input devices (not shown) may include a microphone,joystick, game pad, satellite disk, scanner or the like. These and otherinput devices are often connected to the processing unit 21 through aserial port interface 46 that is coupled to the system bus, but may beconnected by other interfaces, such as a parallel port, game port oruniversal serial bus (USB). A display 47 or other type of display devicecan also be connected to the system bus 23 via an interface, such as avideo adapter 48. In addition to the display 47, computers typicallyinclude other peripheral output devices (not shown), such as speakersand printers. The exemplary system of FIG. 1 also includes a hostadapter 55, Small Computer System Interface (SCSI) bus 56, and anexternal storage device 62 connected to the SCSI bus 56.

The computer 20 may operate in a networked environment using logicalconnections to one or more remote computers, such as a remote computer49. The remote computer 49 may be another computer, a server, a router,a network PC, a peer device or other common network node, and typicallycan include many or all of the elements described above relative to thecomputer 20, although only a memory storage device 50 has beenillustrated in FIG. 1. The logical connections depicted in FIG. 1 caninclude a local area network (LAN) 51 and a wide area network (WAN) 52.Such networking environments are commonplace in offices, enterprise widecomputer networks, intranets and the Internet.

When used in a LAN networking environment, the computer 20 can beconnected to the LAN 51 through a network interface or adapter 53. Whenused in a WAN networking environment, the computer 20 can typicallyinclude a modem 54 or other means for establishing communications overthe wide area network 52, such as the Internet. The modem 54, which maybe internal or external, can be connected to the system bus 23 via theserial port interface 46. In a networked environment, program modulesdepicted relative to the computer 20, or portions thereof, may be storedin the remote memory storage device. It will be appreciated that thenetwork connections shown are exemplary and other means of establishinga communications link between the computers may be used. Moreover, whileit is envisioned that numerous embodiments of the present disclosure areparticularly well-suited for computerized systems, nothing in thisdocument is intended to limit the disclosure to such embodiments.

Referring now to FIG. 2, it depicts a high level block diagram ofcomputer systems that can be used in embodiments of the presentdisclosure. As shown by the figure, computer 20 (e.g., computer systemdescribed above) can include physical hardware devices such as a storagedevice 208, e.g., a hard drive (such as 27 in FIG. 1), a networkinterface controller (NIC) 53, a graphics processing unit 234 (such aswould accompany video adapter 48 from FIG. 1), at least one logicalprocessor 212 (e.g., processing unit 21 from FIG. 1), random accessmemory (RAM) 25. One skilled in the art can appreciate that while onelogical processor is illustrated, in other embodiments computer 20 mayhave multiple logical processors, e.g., multiple execution cores perprocessor and/or multiple processors that could each have multipleexecution cores. Depicted is a hypervisor 202 that may also be referredto in the art as a virtual machine monitor or more generally as avirtual machine manager. The hypervisor 202 in the depicted embodimentincludes executable instructions for controlling and arbitrating accessto the hardware of computer 20. Broadly, the hypervisor 202 can generateexecution environments called partitions such as child partition 1through child partition N (where N is an integer greater than 1). Inembodiments a child partition can be considered the basic unit ofisolation supported by the hypervisor 202, that is, each child partitioncan be mapped to a set of hardware resources, e.g., memory, devices,logical processor cycles, etc., that is under control of the hypervisor202 and/or the parent partition. In embodiments the hypervisor 202 canbe a stand-alone software product, a part of an operating system,embedded within firmware of the motherboard, specialized integratedcircuits, or a combination thereof

In the depicted example configuration, the computer 20 includes a parentpartition 204 that can be configured to provide resources to guestoperating systems executing in the child partitions 1-N by usingvirtualization service providers 228 (VSPs). In this examplearchitecture the parent partition 204 can gate access to the underlyinghardware. Broadly, the VSPs 228 can be used to multiplex the interfacesto the hardware resources by way of virtualization service clients(VSCs). Each child partition can include a virtual processor such asvirtual processors 230 through 232 that guest operating systems 220through 222 can manage and schedule threads to execute thereon.Generally, the virtual processors 230 through 232 are executableinstructions and associated state information that provide arepresentation of a physical processor with a specific architecture. Forexample, one virtual machine may have a virtual processor havingcharacteristics of an Intel ×86 processor, whereas another virtualprocessor may have the characteristics of a PowerPC processor. Thevirtual processors in this example can be mapped to logical processorsof the computer system such that the instructions that effectuate thevirtual processors will be backed by logical processors. Thus, in theseexample embodiments, multiple virtual processors can be simultaneouslyexecuting while, for example, another logical processor is executinghypervisor instructions. Generally speaking, the combination of thevirtual processors and various VSCs in a partition can be considered avirtual machine.

Generally, guest operating systems 220 through 222 can include anyoperating system such as, for example, operating systems fromMicrosoft®, Apple®, the open source community, etc. The guest operatingsystems can include user/kernel modes of operation and can have kernelsthat can include schedulers, memory managers, etc. Each guest operatingsystem 220 through 222 can have associated file systems that can haveapplications stored thereon such as e-commerce servers, email servers,etc., and the guest operating systems themselves. The guest operatingsystems 220-222 can schedule threads to execute on the virtualprocessors 230-232 and instances of such applications can beeffectuated.

FIG. 3 is a block diagram representing an exemplary virtualizedcomputing device where a first operating system (host OS 302) controlsthe Network Interface Device 53. Network Interface Device 53 providesaccess to network 300. Network interface device 53 may be, for example,a network interface card (NIC). Network interface device driver 310provides code for accessing and controlling network interface device 53.Host network stack 330 and guest network stack 340 each provide one ormore modules for processing outgoing data for transmission over network300 and for processing incoming data that is received from network 300.Network stacks 330 and 340 may, for example, include modules forprocessing data in accordance with well known protocols such as Point toPoint protocol (PPP), Transmission Control Protocol (TCP), and InternetProtocol (IP). Host networking application 350 and guest networkingapplication 360 are applications executing on host operating system 204and guest operating system 220, respectively, that access network 300.

As mentioned above, in conventional computing devices which adhere tothe traditional virtualization boundary, data does not pass back andforth between virtualized operating systems. Thus, for example, inconventional configurations, when data is transferred between hostnetworking application 350 and network 300, the data is passed directlyfrom the host network stack 330 to the network interface device driver310. However, in the system of FIG. 3, data does not pass directly fromthe host network stack 330 to the network interface device driver 310.Rather, the data is intercepted by virtual switch 325. Virtual switch325 provides functionality according to an aspect of the invention.

Because the guest OS does not have direct access to the NIC, when thevirtual NIC starts, the hypervisor advertises an LSO size to thenetworking stack indicating that the NIC is capable of LSO with a largepacket size. LSO increases throughput by reducing the amount ofprocessing that is necessary for smaller packet sizes. In general, largepackets are given to the NIC and the NIC breaks the packets into smallerpacket sizes in hardware, relieving the CPU of the work. For example, a64 KB LSO is segmented into smaller segments and then sent out over thenetwork through the NIC. By advertising an LSO packet size to thevirtual NIC on the guest OS that is larger that the LSO-sized packetsthat are accepted by the NIC, the networking stack will pass much largerpackets to the virtual NIC. The virtual NIC in turn will transfer thelarge packets to the virtual switch.

This causes the networking stack to format and send packets that aremuch larger than the MTU size supported by the underlying networkinginfrastructure, and much larger than the physical NIC that the virtualNIC is attached to supports. The packets are large chunks of data thatare larger than a standard TCP packet, but with a TCP header. Theprecise LSO size is tuned to optimize for performance: CPU use,throughput, and latency, whereas previous solutions would choose thelargest value expected to be supported by the underlying hardware NIC.

Normally this packet is sent all the way to the hardware as an LSOpacket, or it is entirely split in software by a software LSO engine toMTU size. Instead, at some point before sending the packet to thehardware, it is split into multiple packets each with a maximum size nogreater than that supported by the hardware's LSO engine, then send thenew packets to the hardware NIC. This step can occur any time before thepacket is sent to hardware, but the closer to the hardware that it isperformed, the better the performance.

This is accomplished with an LSO algorithm, by copying the packetheaders to each sub-packet and adjusting the TCP sequence number,identification field (for IPv4), and header flags. preferably, the IP orTCP checksums are not calculated as is normally required by LSO, becausethat will be performed by the hardware NIC. Similarly, the length fieldin the IP headers is not updated, nor is the TCP pseudo-checksum, asthis would interfere with the NIC's later computation of these fieldswhile performing hardware LSO.

Finally, the software LSO driver must wait to complete the full packetto the sender until all sub-packets have been sent by the NIC and arecompleted. This is achieved by keeping a count of outstandingsub-packets that have not yet completed, and completing the full packetwhen this count reaches zero.

FIG. 4 demonstrates in conjunction with FIG. 3 more detail the flowdescribed above. In particular at 402, the data stream from the networkstack 340 arrives at the virtual NIC driver 342. At 404, the virtual NICdriver 342 configures a packet that is as large as the hypervisor willallow. The LSO size of the packet is preferably much larger that the LSOpacket size supported by NIC 53 in the host partition. The virtual NICdriver 342 then transfers the data to the virtual switch 325 bycommunication services provided by the hypervisor. At 406, the virtualswitch 325 then splits the large format LSO into LSO packets thatconform to the LSO packet size 408 supported by the NIC hardware 53.

At 410, the LSO engine of the NIC hardware 53 splits the LSO packetsinto MTU-sized packets 412 supported by the network infrastructure.Those packets are then transmitted over the network.

FIG. 5 further illustrates the processing performed in virtual switch325 of FIG. 3. At 505, virtual switch 325 receives the packet fromvirtual NIC driver 342. Thereafter at 507, virtual switch 325 determinesif the NIC hardware 53 supports LSO. If LSO is supported, virtual switch325 determines whether the packets is less than the NIC LSO packetsupported. If yes, at 511, the packet is sent to the NIC hardware 53without further processing. If no, at 511, the oversized LSO packet issubdivided into LSO supported packets. On the other hand, at 507, if NICis not supported, then the packets are divided into NIC supportedpackets at 511.

The techniques described allow the virtual machine to be migrated fromone system to another and maximize the performance on each system,preferably tailored to the NIC hardware on each system. To that end,when the virtual NIC driver loads on the target system, the hypervisorprovides a LSO packet size that is then used to send the maximum sizedpacket to the partition that controls the NIC hardware. This allows anoversized packet to be determined for each system based on maximizingthroughput or other parameters that may be desirable on the targetsystem.

The various systems, methods, and techniques described herein may beimplemented with hardware or software or, where appropriate, with acombination of both. Thus, the methods and apparatus of the presentinvention, or certain aspects or portions thereof, may take the form ofprogram code (i.e., instructions) embodied in tangible media, such asfloppy diskettes, CD-ROMs, hard drives, or any other machine-readablestorage medium, wherein, when the program code is loaded into andexecuted by a machine, such as a computer, the machine becomes anapparatus for practicing the invention. In the case of program codeexecution on programmable computers, the computer will generally includea processor, a storage medium readable by the processor (includingvolatile and non-volatile memory and/or storage elements), at least oneinput device, and at least one output device. One or more programs arepreferably implemented in a high level procedural or object orientedprogramming language to communicate with a computer system. However, theprogram(s) can be implemented in assembly or machine language, ifdesired. In any case, the language may be a compiled or interpretedlanguage, and combined with hardware implementations.

The methods and apparatus of the present invention may also be embodiedin the form of program code that is transmitted over some transmissionmedium, such as over electrical wiring or cabling, through fiber optics,or via any other form of transmission, wherein, when the program code isreceived and loaded into and executed by a machine, such as an EPROM, agate array, a programmable logic device (PLD), a client computer, avideo recorder or the like, the machine becomes an apparatus forpracticing the invention. When implemented on a general-purposeprocessor, the program code combines with the processor to provide aunique apparatus that operates to perform the indexing functionality ofthe present invention.

Consequently, the network stack can send very large packets, larger thana physical NIC normally supports with LSO. This is accomplished byperforming multi-pass LSO; a first-stage LSO switch is insertedsomewhere between the network stack and the physical NIC that splitsvery large LSO packets into LSO packets that are small enough for theNIC. The NIC then performs a second pass of LSO by splitting thesesub-packets into standard MTU-sized networking packets for transmissionon the network.

While the present invention has been described in connection with thepreferred embodiments of the various figures, it is to be understoodthat other similar embodiments may be used or modifications andadditions may be made to the described embodiment for performing thesame function of the present invention without deviating there from. Forexample, while exemplary embodiments of the invention are described inthe context of digital devices emulating the functionality of personalcomputers, one skilled in the art will recognize that the presentinvention is not limited to such digital devices, as described in thepresent application may apply to any number of existing or emergingcomputing devices or environments, such as a gaming console, handheldcomputer, portable computer, etc. whether wired or wireless, and may beapplied to any number of such computing devices connected via acommunications network, and interacting across the network. Furthermore,it should be emphasized that a variety of computer platforms, includinghandheld device operating systems and other application specifichardware/software interface systems, are herein contemplated, especiallyas the number of wireless networked devices continues to proliferate.Therefore, the present invention should not be limited to any singleembodiment, but rather construed in breadth and scope in accordance withthe appended claims.

Finally, the disclosed embodiments described herein may be adapted foruse in other processor architectures, computer-based systems, or systemvirtualizations, and such embodiments are expressly anticipated by thedisclosures made herein and, thus, the present invention should not belimited to specific embodiments described herein but instead construedmost broadly. Likewise, the use of synthetic instructions for purposesother than processor virtualization are also anticipated by thedisclosures made herein, and any such utilization of syntheticinstructions in contexts other than processor virtualization should bemost broadly read into the disclosures made herein.

What is claimed:
 1. A method for transmitting packets over a network,comprising: receiving at a first operating system operating on acomputing device, an indicator of a first large segment offload (LSO)packet size wherein the first LSO packet size is a multiple of a secondLSO packet size that is supported by a network interface card connectedto the computing device; formatting data into a first packet of a firstLSO packet size; transferring the first packet to a second operatingsystem on the same computing device; splitting the first packet on thesecond operating system into multiple LSO packets of a second LSO packetsize; sending the multiple LSO packets to the network interface card fortransmission on the network in packets of a size supported by thenetwork.